[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

openshift-cherrypick-robot · 2022-08-16T08:13:26Z

This is an automated cherry-pick of #3271

/assign mandre

Currently a NetworkManager dispatcher script does not have the correct selinux permission to dbus chat with hostnamed. Work around the issue using systemd-run. See: https://bugzilla.redhat.com/show_bug.cgi?id=2111632 Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

If resolve prepender takes more than NetworkManager timeout, currently 90s, it might fail to bring up devices before we had a chance to process all possible events for a device. This needs to account for different type of events and IPv4 and IPv6 events in case of dual stack and overall take less then the NetworkManager timeout. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

node-ip can fail if a device is not ready to be bound to. Retry but don't add to the overall timeout more than the NetworkManager timeout (90s) accounting for all the events we need to attend to. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Make resolv-prepender wait for nameservers in /run/NetworkManager/resolv.conf in all cases to avoid copying it without them to /etc/resolv.conf Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

Without a properly configured resolv.conf, openshift-dns coredns will fail to run. These pods have a default DNS policy and will use the host resolv.conf, which is the one kubelet gets when it starts. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

openshift-ci · 2022-08-16T08:13:54Z

@openshift-cherrypick-robot: Bugzilla bug 2105003 has been cloned as Bugzilla bug 2118586. Retitling PR to link against new bug.
/retitle [release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

In response to this:

[release-4.11] Bug 2105003: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-08-16T08:15:18Z

@openshift-cherrypick-robot: This pull request references Bugzilla bug 2118586, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.z) matches configured target release for branch (4.11.z)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 2105003 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
dependent Bugzilla bug 2105003 targets the "4.12.0" release, which is one of the valid target releases: 4.12.0
bug has dependents

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jcaamano · 2022-08-16T09:04:18Z

/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere
/test e2e-ovirt

knobunc · 2022-08-17T18:15:35Z

/approve

jcaamano · 2022-08-17T18:18:37Z

/assign @sinnykumari

sinnykumari · 2022-08-17T18:36:36Z

should e2e-vsphere and e2e-openstack be green as well?

cybertron · 2022-08-17T19:05:16Z

/test e2e-openstack
/test e2e-vsphere
/lgtm
/label backport-risk-assessed

Ideally, yes.

sinnykumari · 2022-08-18T07:47:37Z

e2e-vsphere and e2e-openstack are still failing. I am adding my approval and will leave this to the on-prem team to decide when this is ready to get merged. Feel free to remove hold when this looks fine.
/hold
/approve
/test e2e-openstack
/test e2e-vsphere

openshift-ci · 2022-08-18T07:48:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, knobunc, openshift-cherrypick-robot, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jcaamano · 2022-08-18T11:50:49Z

I see no history of those jobs passing in this repo/branch.
I don't see any related issues for the job e2e-vsphere
On e2e-openstack last job the dns operator is complaining but that's because something else failed while it was still progressing. Past jobs don't seem to have any related issues. Checked some of the journals and all seems normal.

Triggering once more
/test e2e-openstack
/test e2e-vsphere

jcaamano · 2022-08-18T18:18:31Z

Great, the openstack job now failed with the all too familiar Timed out waiting for node count (5) to equal or exceed machine count (6)
There is still a chance this is unrelated.
I created a dummy PR to crosscheck.
#3297

jcaamano · 2022-08-19T09:29:35Z

/test e2e-openstack
/test e2e-vsphere

mandre · 2022-08-19T10:05:20Z

Great, the openstack job now failed with the all too familiar Timed out waiting for node count (5) to equal or exceed machine count (6) There is still a chance this is unrelated. I created a dummy PR to crosscheck. #3297

According to the machine log, bzlr9p1b-174af-v968k-worker-0-4jvtf failed validation with:

"errorMessage": "Machine validation failed: \nError getting a new instance service from the machine: Failed to get cloud from secret: Failed to get secrets from kubernetes api: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/openstack-cloud-credentials\": dial tcp 172.30.0.1:443: i/o timeout - error from a previous attempt: read tcp 10.130.0.12:49270-\u003e172.30.0.1:443: read: connection reset by peer",
"errorReason": "InvalidConfiguration",

It looks like the API server was not reachable at the time the node made the request, however DNS seemed to work, so apparently a different issue (most likely infra related).

jcaamano · 2022-08-19T12:31:54Z

last openstack job had problems in master-0, weird issues with openshift-sdn and must-gather was not able to collect node logs. While this does not look very good, I can't still tie anything specific to these changes.

At least the vsphere job passed

/test e2e-openstack

jcaamano · 2022-08-19T14:26:52Z

@mandre again the node count issue, no error but there's only 2 workers and both are worker-0, how's that?
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3287/pull-ci-openshift-machine-config-operator-release-4.11-e2e-openstack/1560605481284669440/artifacts/e2e-openstack/gather-extra/artifacts/machines.json

mandre · 2022-08-19T14:29:45Z

This time, the openstack job failed again with the same Timed out waiting for node count (5) to equal or exceed machine count (6) error, except this time machine sz3jc4kz-174af-qck5g-worker-0-9h59q is in Provisioned status.

Logs from the instance show the following error:

[  385.651927] overlayfs: failed to resolve '/var/lib/containers/storage/overlay/l/RNRNIUQQVY6AO6BZOTQXMWC4KJ': -2

I'm not quite sure what causes it.

mandre · 2022-08-19T14:36:58Z

@mandre again the node count issue, no error but there's only 2 workers and both are worker-0, how's that?

That's because they were created by the worker-0 machineset. It's the convention we use in OpenStack, where the installer suffix the machinesets with an index representing an AZ number. Nothing to worry about. If we had another AZ, we would have worker-1 machineset and so on.

jcaamano · 2022-08-19T15:18:30Z

I am trying to run this on my own with cluster bot. In the meantime...

/test e2e-openstack

jcaamano · 2022-08-22T10:40:43Z

cluster bot was able to run the cluster with no issues

/test e2e-openstack

jcaamano · 2022-08-22T13:18:47Z

On the last run, now just some tests fail
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3287/pull-ci-openshift-machine-config-operator-release-4.11-e2e-openstack/1561664665711284224

Of those, many I have seen also failing in the test PR #3297 job
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3297/pull-ci-openshift-machine-config-operator-release-4.11-e2e-openstack/1560604743808585728

including
[sig-arch] events should not repeat pathologically
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[bz-etcd][invariant] alert/etcdMemberCommunicationSlow should not be at or above info

then there are some tests that fail in one and not in the other and viceversa.

It looks to me that this infra is very sensible to this job but I at least managed to pass it once on my test PR.

/test e2e-openstack

mandre · 2022-08-22T13:43:01Z

OK, this last run looks better. We're getting a lot of etcd related test failures on openstack recently due to the underlying infra, so the job failures aren't too alarming. I'd be willing to merge now if needed.

cybertron · 2022-08-22T17:10:24Z

I think we're all in agreement that the openstack failure unlikely to be caused by this patch, so we can go ahead and merge without that job in this instance.

sinnykumari · 2022-08-22T17:59:31Z

Removing hold as e2e-openstack test failure is unrelated
/hold cancel

rbbratta · 2022-08-22T20:03:51Z

/label cherry-pick-approved

openshift-ci-robot · 2022-08-22T23:32:49Z

/retest-required

Remaining retests: 2 against base HEAD 54a105e and 8 for PR HEAD 5b18d21 in total

openshift-ci-robot · 2022-08-23T02:08:47Z

/retest-required

Remaining retests: 1 against base HEAD 54a105e and 7 for PR HEAD 5b18d21 in total

openshift-ci · 2022-08-23T04:13:18Z

@openshift-cherrypick-robot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-disruptive	`5b18d21`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-aws-upgrade-single-node	`5b18d21`	link	false	`/test e2e-aws-upgrade-single-node`
ci/prow/e2e-aws-single-node	`5b18d21`	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-openstack	`5b18d21`	link	false	`/test e2e-openstack`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-08-23T04:17:02Z

@openshift-cherrypick-robot: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3287

Bugzilla bug 2118586 has been moved to the MODIFIED state.

In response to this:

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jcaamano added 5 commits August 16, 2022 08:13

on-prem: make resolv-prepender wait for nameservers

5121552

Make resolv-prepender wait for nameservers in /run/NetworkManager/resolv.conf in all cases to avoid copying it without them to /etc/resolv.conf Signed-off-by: Jaime Caamaño Ruiz <[email protected]>

openshift-cherrypick-robot mentioned this pull request Aug 16, 2022

Bug 2105003: on-prem: improvements on resolv-prepender #3271

Merged

openshift-ci bot assigned mandre Aug 16, 2022

openshift-ci bot changed the title ~~[release-4.11] Bug 2105003: on-prem: improvements on resolv-prepender~~ [release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender Aug 16, 2022

openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 16, 2022

openshift-ci bot requested review from sunzhaohua2, celebdor and stephenfin August 16, 2022 08:15

jcaamano mentioned this pull request Aug 16, 2022

[release-4.11] on-prem: improvements on resolv-prepender #3282

Closed

openshift-ci bot assigned sinnykumari Aug 17, 2022

openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Aug 17, 2022

openshift-ci bot assigned cybertron Aug 17, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 17, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 18, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2022

openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 22, 2022

openshift-merge-robot merged commit d33d8dc into openshift:release-4.11 Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

openshift-cherrypick-robot commented Aug 16, 2022

openshift-ci bot commented Aug 16, 2022

openshift-ci bot commented Aug 16, 2022

jcaamano commented Aug 16, 2022

knobunc commented Aug 17, 2022

jcaamano commented Aug 17, 2022

sinnykumari commented Aug 17, 2022

cybertron commented Aug 17, 2022

sinnykumari commented Aug 18, 2022

openshift-ci bot commented Aug 18, 2022

jcaamano commented Aug 18, 2022

jcaamano commented Aug 18, 2022

jcaamano commented Aug 19, 2022

mandre commented Aug 19, 2022

jcaamano commented Aug 19, 2022

jcaamano commented Aug 19, 2022

mandre commented Aug 19, 2022

mandre commented Aug 19, 2022

jcaamano commented Aug 19, 2022

jcaamano commented Aug 22, 2022

jcaamano commented Aug 22, 2022

mandre commented Aug 22, 2022

cybertron commented Aug 22, 2022

sinnykumari commented Aug 22, 2022

rbbratta commented Aug 22, 2022

openshift-ci-robot commented Aug 22, 2022

openshift-ci-robot commented Aug 23, 2022

openshift-ci bot commented Aug 23, 2022

openshift-ci bot commented Aug 23, 2022

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

Conversation

openshift-cherrypick-robot commented Aug 16, 2022

openshift-ci bot commented Aug 16, 2022

openshift-ci bot commented Aug 16, 2022

jcaamano commented Aug 16, 2022

knobunc commented Aug 17, 2022

jcaamano commented Aug 17, 2022

sinnykumari commented Aug 17, 2022

cybertron commented Aug 17, 2022

sinnykumari commented Aug 18, 2022

openshift-ci bot commented Aug 18, 2022

jcaamano commented Aug 18, 2022

jcaamano commented Aug 18, 2022

jcaamano commented Aug 19, 2022

mandre commented Aug 19, 2022

jcaamano commented Aug 19, 2022

jcaamano commented Aug 19, 2022

mandre commented Aug 19, 2022

mandre commented Aug 19, 2022

jcaamano commented Aug 19, 2022

jcaamano commented Aug 22, 2022

jcaamano commented Aug 22, 2022

mandre commented Aug 22, 2022

cybertron commented Aug 22, 2022

sinnykumari commented Aug 22, 2022

rbbratta commented Aug 22, 2022

openshift-ci-robot commented Aug 22, 2022

openshift-ci-robot commented Aug 23, 2022

openshift-ci bot commented Aug 23, 2022

openshift-ci bot commented Aug 23, 2022